31 research outputs found
FlowCam: Training Generalizable 3D Radiance Fields without Camera Poses via Pixel-Aligned Scene Flow
Reconstruction of 3D neural fields from posed images has emerged as a
promising method for self-supervised representation learning. The key challenge
preventing the deployment of these 3D scene learners on large-scale video data
is their dependence on precise camera poses from structure-from-motion, which
is prohibitively expensive to run at scale. We propose a method that jointly
reconstructs camera poses and 3D neural scene representations online and in a
single forward pass. We estimate poses by first lifting frame-to-frame optical
flow to 3D scene flow via differentiable rendering, preserving locality and
shift-equivariance of the image processing backbone. SE(3) camera pose
estimation is then performed via a weighted least-squares fit to the scene flow
field. This formulation enables us to jointly supervise pose estimation and a
generalizable neural scene representation via re-rendering the input video, and
thus, train end-to-end and fully self-supervised on real-world video datasets.
We demonstrate that our method performs robustly on diverse, real-world video,
notably on sequences traditionally challenging to optimization-based pose
estimation techniques.Comment: Project website: http://cameronosmith.github.io/flowca
Variational Barycentric Coordinates
We propose a variational technique to optimize for generalized barycentric
coordinates that offers additional control compared to existing models. Prior
work represents barycentric coordinates using meshes or closed-form formulae,
in practice limiting the choice of objective function. In contrast, we directly
parameterize the continuous function that maps any coordinate in a polytope's
interior to its barycentric coordinates using a neural field. This formulation
is enabled by our theoretical characterization of barycentric coordinates,
which allows us to construct neural fields that parameterize the entire
function class of valid coordinates. We demonstrate the flexibility of our
model using a variety of objective functions, including multiple smoothness and
deformation-aware energies; as a side contribution, we also present
mathematically-justified means of measuring and minimizing objectives like
total variation on discontinuous neural fields. We offer a practical
acceleration strategy, present a thorough validation of our algorithm, and
demonstrate several applications.Comment: https://anadodik.github.io
Movie Editing and Cognitive Event Segmentation in Virtual Reality Video
Traditional cinematography has relied for over a century on a
well-established set of editing rules, called continuity editing, to create a
sense of situational continuity. Despite massive changes in visual content
across cuts, viewers in general experience no trouble perceiving the
discontinuous flow of information as a coherent set of events. However, Virtual
Reality (VR) movies are intrinsically different from traditional movies in that
the viewer controls the camera orientation at all times. As a consequence,
common editing techniques that rely on camera orientations, zooms, etc., cannot
be used. In this paper we investigate key relevant questions to understand how
well traditional movie editing carries over to VR. To do so, we rely on recent
cognition studies and the event segmentation theory, which states that our
brains segment continuous actions into a series of discrete, meaningful events.
We first replicate one of these studies to assess whether the predictions of
such theory can be applied to VR. We next gather gaze data from viewers
watching VR videos containing different edits with varying parameters, and
provide the first systematic analysis of viewers' behavior and the perception
of continuity in VR. From this analysis we make a series of relevant findings;
for instance, our data suggests that predictions from the cognitive event
segmentation theory are useful guides for VR editing; that different types of
edits are equally well understood in terms of continuity; and that spatial
misalignments between regions of interest at the edit boundaries favor a more
exploratory behavior even after viewers have fixated on a new region of
interest. In addition, we propose a number of metrics to describe viewers'
attentional behavior in VR. We believe the insights derived from our work can
be useful as guidelines for VR content creation
Approaching human 3D shape perception with neurally mappable models
Humans effortlessly infer the 3D shape of objects. What computations underlie
this ability? Although various computational models have been proposed, none of
them capture the human ability to match object shape across viewpoints. Here,
we ask whether and how this gap might be closed. We begin with a relatively
novel class of computational models, 3D neural fields, which encapsulate the
basic principles of classic analysis-by-synthesis in a deep neural network
(DNN). First, we find that a 3D Light Field Network (3D-LFN) supports 3D
matching judgments well aligned to humans for within-category comparisons,
adversarially-defined comparisons that accentuate the 3D failure cases of
standard DNN models, and adversarially-defined comparisons for algorithmically
generated shapes with no category structure. We then investigate the source of
the 3D-LFN's ability to achieve human-aligned performance through a series of
computational experiments. Exposure to multiple viewpoints of objects during
training and a multi-view learning objective are the primary factors behind
model-human alignment; even conventional DNN architectures come much closer to
human behavior when trained with multi-view objectives. Finally, we find that
while the models trained with multi-view learning objectives are able to
partially generalize to new object categories, they fall short of human
alignment. This work provides a foundation for understanding human shape
inferences within neurally mappable computational architectures and highlights
important questions for future work